Multiple sequence alignment

A multiple sequence alignment (MSA) is a sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutionary relationship by which they share a lineage and are descended from a common ancestor. From the resulting MSA, sequence homology can be inferred and phylogenetic analysis can be conducted to assess the sequences' shared evolutionary origins. Visual depictions of the alignment as in the image at right illustrate mutation events such as point mutations (single amino acid or nucleotide changes) that appear as differing characters in a single alignment column, and insertion or deletion mutations (indels or gaps) that appear as hyphens in one or more of the sequences in the alignment. Multiple sequence alignment is often used to assess sequence conservation of protein domains, tertiary and secondary structures, and even individual amino acids or nucleotides.

Multiple sequence alignment also refers to the process of aligning such a sequence set. Because three or more sequences of biologically relevant length can be difficult and are almost always time-consuming to align by hand, computational algorithms are used to produce and analyze the alignments. MSAs require more sophisticated methodologies than pairwise alignment because they are more computationally complex. Most multiple sequence alignment programs use heuristic methods rather than global optimization because identifying the optimal alignment between more than a few sequences of moderate length is prohibitively computationally expensive.

Contents

Dynamic programming and computational complexity

A direct method for producing an MSA uses the dynamic programming technique to identify the globally optimal alignment solution. For proteins, this method usually involves two sets of parameters: a gap penalty and a substitution matrix assigning scores or probabilities to the alignment of each possible pair of amino acids based on the similarity of the amino acids' chemical properties and the evolutionary probability of the mutation. For nucleotide sequences a similar gap penalty is used, but a much simpler substitution matrix, wherein only identical matches and mismatches are considered, is typical. The scores in the substitution matrix may be either all positive or a mix of positive and negative in the case of a global alignment, but must be both positive and negative, in the case of a local alignment.[1]

For n individual sequences, the naive method requires constructing the n-dimensional equivalent of the matrix formed in standard pairwise sequence alignment. The search space thus increases exponentially with increasing n and is also strongly dependent on sequence length. Expressed with the big O notation commonly used to measure computational complexity, a naïve MSA takes O(LengthNseqs) time to produce. To find the global optimum for n sequences this way has been shown to be an NP-complete problem.[2][3][4] In 1989, based on Carrillo-Lipman Algorithm,[5] Altschul introduced a practical method that uses pairwise alignments to constrain the n-dimensional search space.[6] In this approach pairwise dynamic programming alignments are performed on each pair of sequences in the query set, and only the space near the n-dimensional intersection of these alignments is searched for the n-way alignment. The MSA program optimizes the sum of all of the pairs of characters at each position in the alignment (the so-called sum of pair score) and has been implemented in a software program for constructing multiple sequence alignments.[7]

Progressive alignment construction

The most widely used approach to multiple sequence alignments uses a heuristic search known as progressive technique (also known as the hierarchical or tree method), that builds up a final MSA by combining pairwise alignments beginning with the most similar pair and progressing to the most distantly related. All progressive alignment methods require two stages: a first stage in which the relationships between the sequences are represented as a tree, called a guide tree, and a second step in which the MSA is built by adding the sequences sequentially to the growing MSA according to the guide tree. The initial guide tree is determined by an efficient clustering method such as neighbor-joining or UPGMA, and may use distances based on the number of identical two letter sub-sequences (as in FASTA rather than a dynamic programming alignment).[8]

Progressive alignments are not guaranteed to be globally optimal. The primary problem is that when errors are made at any stage in growing the MSA, these errors are then propagated through to the final result. Performance is also particularly bad when all of the sequences in the set are rather distantly related. Most modern progressive methods modify their scoring function with a secondary weighting function that assigns scaling factors to individual members of the query set in a nonlinear fashion based on their phylogenetic distance from their nearest neighbors. This corrects for non-random selection of the sequences given to the alignment program.[8]

Progressive alignment methods are efficient enough to implement on a large scale for many (100s to 1000s) sequences. Progressive alignment services are commonly available on publicly accessible web servers so users need not locally install the applications of interest. The most popular progressive alignment method has been the Clustal family,[9] especially the weighted variant ClustalW[10] to which access is provided by a large number of web portals including GenomeNet, EBI, and EMBNet. Different portals or implementations can vary in user interface and make different parameters accessible to the user. ClustalW is used extensively for phylogenetic tree construction, in spite of the author's explicit warnings that unedited alignments should not be used in such studies and as input for protein structure prediction by homology modeling.

Another common progressive alignment method called T-Coffee[11] is slower than Clustal and its derivatives but generally produces more accurate alignments for distantly related sequence sets. T-Coffee calculates pairwise alignments by combining the direct alignment of the pair with indirect alignments that aligns each sequence of the pair to a third sequence. It uses the output from Clustal as well as another local alignment program LALIGN, which finds multiple regions of local alignment between two sequences. The resulting alignment and phylogenetic tree are used as a guide to produce new and more accurate weighting factors.

Because progressive methods are heuristics that are not guaranteed to converge to a global optimum, alignment quality can be difficult to evaluate and their true biological significance can be obscure. A semi-progressive method that improves alignment quality and does not use a lossy heuristic while still running in polynomial time has been implemented in the program PSAlign.[12]

Iterative methods

A set of methods to produce MSAs while reducing the errors inherent in progressive methods are classified as "iterative" because they work similarly to progressive methods but repeatedly realign the initial sequences as well as adding new sequences to the growing MSA. One reason progressive methods are so strongly dependent on a high-quality initial alignment is the fact that these alignments are always incorporated into the final result - that is, once a sequence has been aligned into the MSA, its alignment is not considered further. This approximation improves efficiency at the cost of accuracy. By contrast, iterative methods can return to previously calculated pairwise alignments or sub-MSAs incorporating subsets of the query sequence as a means of optimizing a general objective function such as finding a high-quality alignment score.[8]

A variety of subtly different iteration methods have been implemented and made available in software packages; reviews and comparisons have been useful but generally refrain from choosing a "best" technique.[13] The software package PRRN/PRRP uses a hill-climbing algorithm to optimize its MSA alignment score[14] and iteratively corrects both alignment weights and locally divergent or "gappy" regions of the growing MSA.[8] PRRP performs best when refining an alignment previously constructed by a faster method.[8]

Another iterative program, DIALIGN, takes an unusual approach of focusing narrowly on local alignments between sub-segments or sequence motifs without introducing a gap penalty.[15] The alignment of individual motifs is then achieved with a matrix representation similar to a dot-matrix plot in a pairwise alignment. An alternative method that uses fast local alignments as anchor points or "seeds" for a slower global-alignment procedure is implemented in the CHAOS/DIALIGN suite.[15]

A third popular iteration-based method called MUSCLE (multiple sequence alignment by log-expectation) improves on progressive methods with a more accurate distance measure to assess the relatedness of two sequences.[16] The distance measure is updated between iteration stages (although, in its original form, MUSCLE contained only 2-3 iterations depending on whether refinement was enabled).

Hidden Markov models

Hidden Markov models are probabilistic models that can assign likelihoods to all possible combinations of gaps, matches, and mismatches to determine the most likely MSA or set of possible MSAs. HMMs can produce a single highest-scoring output but can also generate a family of possible alignments that can then be evaluated for biological significance. HMMs can produce both global and local alignments. Although HMM-based methods have been developed relatively recently, they offer significant improvements in computational speed, especially for sequences that contain overlapping regions.[8]

Typical HMM-based methods work by representing an MSA as a form of directed acyclic graph known as a partial-order graph, which consists of a series of nodes representing possible entries in the columns of an MSA. In this representation a column that is absolutely conserved (that is, that all the sequences in the MSA share a particular character at a particular position) is coded as a single node with as many outgoing connections as there are possible characters in the next column of the alignment. In the terms of a typical hidden Markov model, the observed states are the individual alignment columns and the "hidden" states represent the presumed ancestral sequence from which the sequences in the query set are hypothesized to have descended. An efficient search variant of the dynamic programming method, known as the Viterbi algorithm, is generally used to successively align the growing MSA to the next sequence in the query set to produce a new MSA.[17] This is distinct from progressive alignment methods because the alignment of prior sequences is updated at each new sequence addition. However, like progressive methods, this technique can be influenced by the order in which the sequences in the query set are integrated into the alignment, especially when the sequences are distantly related.[8]

Several software programs are available in which variants of HMM-based methods have been implemented and which are noted for their scalability and efficiency, although properly using an HMM method is more complex than using more common progressive methods. The simplest is POA (Partial-Order Alignment);[18] a similar but more generalized method is implemented in the packages SAM (Sequence Alignment and Modeling System).[17] and HMMER.[19] SAM has been used as a source of alignments for protein structure prediction to participate in the CASP structure prediction experiment and to develop a database of predicted proteins in the yeast species S. cerevisiae. HHsearch[20] is a software package for the detection of remotely related protein sequences based on the pairwise comparison of HMMs. A server running HHsearch (HHpred) was by far the fastest of the 10 best automatic structure prediction servers in the CASP7 and CASP8 structure prediction competitions.[21]

Genetic algorithms and simulated annealing

Standard optimization techniques in computer science - both of which were inspired by, but do not directly reproduce, physical processes - have also been used in an attempt to more efficiently produce quality MSAs. One such technique, genetic algorithms, has been used for MSA production in an attempt to broadly simulate the hypothesized evolutionary process that gave rise to the divergence in the query set. The method works by breaking a series of possible MSAs into fragments and repeatedly rearranging those fragments with the introduction of gaps at varying positions. A general objective function is optimized during the simulation, most generally the "sum of pairs" maximization function introduced in dynamic programming-based MSA methods. A technique for protein sequences has been implemented in the software program SAGA (Sequence Alignment by Genetic Algorithm)[22] and its equivalent in RNA is called RAGA.[23]

The technique of simulated annealing, by which an existing MSA produced by another method is refined by a series of rearrangements designed to find better regions of alignment space than the one the input alignment already occupies. Like the genetic algorithm method, simulated annealing maximizes an objective function like the sum-of-pairs function. Simulated annealing uses a metaphorical "temperature factor" that determines the rate at which rearrangements proceed and the likelihood of each rearrangement; typical usage alternates periods of high rearrangement rates with relatively low likelihood (to explore more distant regions of alignment space) with periods of lower rates and higher likelihoods to more thoroughly explore local minima near the newly "colonized" regions. This approach has been implemented in the program MSASA (Multiple Sequence Alignment by Simulated Annealing).[24]

Motif finding

Motif finding, also known as profile analysis, is a method of locating sequence motifs in global MSAs that is both a means of producing a better MSA and a means of producing a scoring matrix for use in searching other sequences for similar motifs. A variety of methods for isolating the motifs have been developed, but all are based on identifying short highly conserved patterns within the larger alignment and constructing a matrix similar to a substitution matrix that reflects the amino acid or nucleotide composition of each position in the putative motif. The alignment can then be refined using these matrices. In standard profile analysis, the matrix includes entries for each possible character as well as entries for gaps.[8] Alternatively, statistical pattern-finding algorithms can identify motifs as a precursor to an MSA rather than as a derivation. In many cases when the query set contains only a small number of sequences or contains only highly related sequences, pseudocounts are added to normalize the distribution reflected in the scoring matrix. In particular, this corrects zero-probability entries in the matrix to values that are small but nonzero.

Blocks analysis is a method of motif finding that restricts motifs to ungapped regions in the alignment. Blocks can be generated from an MSA or they can be extracted from unaligned sequences using a precalculated set of common motifs previously generated from known gene families.[25] Block scoring generally relies on the spacing of high-frequency characters rather than on the calculation of an explicit substitution matrix. The BLOCKS server provides an interactive method to locate such motifs in unaligned sequences.

Statistical pattern-matching has been implemented using both the expectation-maximization algorithm and the Gibbs sampler. One of the most common motif-finding tools, known as MEME, uses expectation maximization and hidden Markov methods to generate motifs that are then used as search tools by its companion MAST in the combined suite MEME/MAST.[26][27]

Visualization and editing tools

The necessary use of heuristics for multiple alignment means that for an arbitrary set of proteins, there is always a good chance that an alignment will contain errors. These can arise because of unique insertions into one or more regions of sequences, or through some more complex evolutionary process leading to proteins that do not align easily by sequence alone. Multiple sequence alignment viewers enable alignments to be visually verified, often by inspecting the quality of alignment for annotated functional sites on two or more sequences. Many also enable the alignment to be edited to correct these (usually minor) errors, in order to obtain an optimal 'curated' alignment suitable for use in phylogenetic analysis or comparative modeling.[28]

Use in phylogenetics

Multiple sequence alignments can be used to create a phylogenetic tree.[29] This is made possible by two reasons. The first is because functional domains that are known in annotated sequences can be used for alignment in non-annotated sequences. The other is that conserved regions known to be functionally important can be found. This makes it possible for multiple sequence alignments to be used to analyze and find evolutionary relationships through homology between sequences. Point mutations and insertion or deletion events (called indels) can be detected.

Multiple sequence alignments can also be used to identify functionally important sites, such as binding sites, active sites, or sites corresponding to other key functions, by locating conserved domains. When looking at multiple sequence alignments, it is useful to consider different aspects of the sequences when comparing sequences. These aspects include identity, similarity, and homology. Identity means that the sequences have identical residues at their respective positions. On the other hand, similarity has to do with the sequences being compared having similar residues quantitatively. For example, in terms of nucleotide sequences, pyrimidines are considered similar to each other, as are purines. Similarity ultimately leads to homology, in that the more similar sequences are, the closer they are to being homologous. This homology in sequences, can then go on to help find common ancestry.[29]

See also

References

  1. ^ "Help with matrices used in sequence comparison tools". European Bioinformatics Institute. http://www.ebi.ac.uk/help/matrix.html. Retrieved March 3, 2010. 
  2. ^ Wang L, Jiang T (1994). "On the complexity of multiple sequence alignment". J Comput Biol 1 (4): 337–348. doi:10.1089/cmb.1994.1.337. PMID 8790475. 
  3. ^ Just W (2001). "Computational complexity of multiple sequence alignment with SP-score". J Comput Biol 8 (6): 615–23. doi:10.1089/106652701753307511. PMID 11747615. 
  4. ^ Elias, Isaac (2006). "Settling the intractability of multiple alignment". J Comput Biol 13 (7): 1323–1339. doi:10.1089/cmb.2006.13.1323. PMID 17037961. http://www.liebertonline.com/doi/abs/10.1089/cmb.2006.13.1323. 
  5. ^ Carrillo H, Lipman DJ,(1988) The Multiple Sequence Alignment Problem in Biology. SIAM Journal of Applied Mathematics, Vol.48, No. 5, 1073-1082
  6. ^ Lipman DJ, Altschul SF, Kececioglu JD (1989). "A tool for multiple sequence alignment". Proc Natl Acad Sci U S A 86 (12): 4412–4415. doi:10.1073/pnas.86.12.4412. PMC 287279. PMID 2734293. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=287279. 
  7. ^ "Genetic analysis software". National Center for Biotechnology Information. http://www.ncbi.nlm.nih.gov/CBBresearch/Schaffer/msa.html. Retrieved March 3, 2010. 
  8. ^ a b c d e f g h Mount DM. (2004). Bioinformatics: Sequence and Genome Analysis 2nd ed. Cold Spring Harbor Laboratory Press: Cold Spring Harbor, NY.
  9. ^ Higgins DG, Sharp PM (1988). "CLUSTAL: a package for performing multiple sequence alignment on a microcomputer". Gene 73 (1): 237–244. doi:10.1016/0378-1119(88)90330-7. PMID 3243435. 
  10. ^ Thompson JD, Higgins DG, Gibson TJ (1994). "CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice". Nucleic Acids Res 22 (22): 4673–4680. doi:10.1093/nar/22.22.4673. PMC 308517. PMID 7984417. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=308517. 
  11. ^ Notredame C, Higgins DG, Heringa J (2000). "T-Coffee: A novel method for fast and accurate multiple sequence alignment". J Mol Biol 302 (1): 205–217. doi:10.1006/jmbi.2000.4042. PMID 10964570. 
  12. ^ Sze SH, Lu Y, Yang Q (2006). "A polynomial time solvable formulation of multiple sequence alignment". J Comput Biol 13 (2): 309–319. doi:10.1089/cmb.2006.13.309. PMID 16597242. 
  13. ^ Hirosawa M, Totoki Y, Hoshida M, Ishikawa M (1995). "Comprehensive study on iterative algorithms of multiple sequence alignment". Comput Appl Biosci 11 (1): 13–18. doi:10.1093/bioinformatics/11.1.13. PMID 7796270. 
  14. ^ Gotoh O (1996). "Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments". J Mol Biol 264 (4): 823–38. doi:10.1006/jmbi.1996.0679. PMID 8980688. 
  15. ^ a b Brudno M, Chapman M, Göttgens B, Batzoglou S, Morgenstern B (2003). "Fast and sensitive multiple alignment of large genomic sequences". BMC Bioinformatics 4: 66. 
  16. ^ Edgar RC (2004). "MUSCLE: multiple sequence alignment with high accuracy and high throughput". Nucleic Acids Research 32 (5): 1792–97. doi:10.1093/nar/gkh340. PMC 390337. PMID 15034147. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=390337. 
  17. ^ a b Hughey R, Krogh A (1996). "Hidden Markov models for sequence analysis: extension and analysis of the basic method". CABIOS 12 (2): 95–107. PMID 8744772. 
  18. ^ Grasso C, Lee C (2004). "Combining partial order alignment and progressive multiple sequence alignment increases alignment speed and scalability to very large alignment problems". Bioinformatics 20 (10): 1546–56. doi:10.1093/bioinformatics/bth126. PMID 14962922. 
  19. ^ Durbin R, Eddy S, Krogh A, Mitchison G. (1998). Biological sequence analysis: probabilistic models of proteins and nucleic acids, Cambridge University Press, 1998.
  20. ^ Söding J (2005). "Protein homology detection by HMM-HMM comparison". Bioinformatics 21 (7): 951–960. doi:10.1093/bioinformatics/bti125. PMID 15531603. 
  21. ^ Battey JN, Kopp J, Bordoli L, Read RJ, Clarke ND, Schwede T (2007). "Automated server predictions in CASP7". Proteins 69 (Suppl 8): 68–82. doi:10.1002/prot.21761. PMID 17894354. 
  22. ^ Notredame C, Higgins DG (1996). "SAGA: sequence alignment by genetic algorithm". Nucleic Acids Res 24 (8): 1515–24. doi:10.1093/nar/24.8.1515. PMC 145823. PMID 8628686. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=145823. 
  23. ^ Notredame C, O'Brien EA, Higgins DG (1997). "RAGA: RNA sequence alignment by genetic algorithm". Nucleic Acids Res 25 (22): 4570–80. doi:10.1093/nar/25.22.4570. PMC 147093. PMID 9358168. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=147093. 
  24. ^ Kim J, Pramanik S, Chung MJ (1994). "Multiple sequence alignment using simulated annealing". Comput Appl Biosci 10 (4): 419–26. PMID 7804875. 
  25. ^ Henikoff S, Henikoff JG (1991). "Automated assembly of protein blocks for database searching". Nucleic Acids Res 19 (23): 6565–6572. doi:10.1093/nar/19.23.6565. PMC 329220. PMID 1754394. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=329220. 
  26. ^ Bailey TL, Elkan C (1994). "Fitting a mixture model by expectation maximization to discover motifs in biopolymers". Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology. Menlo Park, California: AAAI Press. pp. 28–36. 
  27. ^ Bailey TL, Gribskov M (1998). "Combining evidence using p-values: application to sequence homology searches". Bioinformatics 14 (1): 48–54. doi:10.1093/bioinformatics/14.1.48. PMID 9520501. 
  28. ^ "Manual editing and adjustment of MSAs". European Molecular Biology Laboratory. 2007. http://www.embl.de/~seqanal/MSAcambridgeGenetics2007/MSAmanualAdjustments/MSAmanualAdjustments.html. Retrieved March 7, 2010. 
  29. ^ a b Budd, Aidan (10 February 2009). "Multiple sequence alignment exercises and demonstrations". European Molecular Biology Laboratory. http://www.embl.de/~seqanal/courses/commonCourseContent/commonMsaExercises.html. Retrieved June 30, 2010. 

Survey articles

External links

Lecture notes, tutorials, and courses